## ── Attaching packages ────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ dplyr 1.0.0
## ✓ tidyr 1.1.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
As the number of COVID-19 cases soars to unprecedented heights around the United States, public health experts and many political figures continue to emphasize mask wearing as one of the most effective ways to slow the spread of the pandemic. But, as a New York Times survey from July 2020 shows, mask wearing adherence varies widely in counties around the nation. What predictors might explain this variation in mask wearing, and how might public health officials use this information to develop more effective mask-wearing interventions? To what extent can mask wearing predict the spread of the virus on a county level?
To address these questions, we plan to create two models: one to predict mask-wearing adherence by county based on a variety of county and state-wide predictors, and one to predict the spread of coronavirus in a county based on mask-wearing. Our data about mask-wearing (which is our outcome in the first model and a predictor in the second model) is from the aforementioned New York Times survey, which was conducted by the survey firm Dynata on behalf of the Times from July 2 to July 14. Aggregated at the county level, it sorts 250,000 individual responses into 3,000 U.S. counties (suggesting that a mixed effects model will likely be a useful approach). The survey asked respondents how often they wore a mask (choices were always, frequently, sometimes, rarely, or never) and presents the percentage of people who gave each answer for every county, which we combined into a single weighted average representing the probability that a randomly selected person is wearing a mask in the county.
Our predictor variables were compiled from a variety of sources and joined with the mask wearing data by county FIPS code. We included gender, political party, education, and age statistics at the county level as all of these demographics have shown to differ in mask wearing frequency in prior surveys, with political party being especially significant. Other data, such as this poll from the Pew Research Center have suggested that mask wearing varies by race: this, combined with the fact that the pandemic has disproportionately impacted communities of color according to the CDC motivated us to include variables about the racial composition of counties in our baseline model. Researchers at the National Institute of Health have suggested that age and location (i.e. rural vs. urban setting) likewise affect mask wearing behavior, so we included the percentage of seniors in a county (since COVID-19 most severely affects the elderly) and various measures of population density in our mask-wearing model. Finally, we wanted to look beyond county demographics and determine whether coronavirus-related measures, including number of cases/deaths, growth rate of the virus at the time of the survey, and local/statewide mask mandates explained any of the variation in mask wearing by county.
For a complete list of the variables in our clean and compiled dataset and their sources, see the table below.
| Variable Names and Descriptions | |||
|---|---|---|---|
| Name | Description | Source | Source URL |
| countyfp | County level FIPS (Federal Information Processing System) code. Unique for each American county. | New York Times | https://github.com/nytimes/covid-19-data/blob/master/mask-use/mask-use-by-county.csv |
| county_name | Name of the county | NA | NA |
| state | state the county is located in | NA | NA |
| pct_mask | An aggregate variable representing the probability that a randomly selected person in a county will wear a mask. Calculated by 1*(always) + 0.75*(frequently)+0.5*(sometimes)+0.25*(rarely)+0*(never) | NA | NA |
| always | Percentage of people who answered they "always" wear a mask | New York Times | https://github.com/nytimes/covid-19-data/blob/master/mask-use/mask-use-by-county.csv |
| frequently | Percentage of peoplewho answered they frequently wear a mask | New York Times | https://github.com/nytimes/covid-19-data/blob/master/mask-use/mask-use-by-county.csv |
| sometimes | Percentage of people who answered they sometimes wear a mask | New York Times | https://github.com/nytimes/covid-19-data/blob/master/mask-use/mask-use-by-county.csv |
| rarely | Percentage of peoplewho answered they rarely wear a mask | New York Times | https://github.com/nytimes/covid-19-data/blob/master/mask-use/mask-use-by-county.csv |
| never | Percentage of people who answered they never wear a mask | New York Times | https://github.com/nytimes/covid-19-data/blob/master/mask-use/mask-use-by-county.csv |
| cases_02 | Number of COVID-19 cases on 07/02/2020 | New York Times | https://github.com/nytimes/covid-19-data |
| deaths_02 | Number of COVID-19 deaths on 07/02/2020 | NA | https://github.com/nytimes/covid-19-data |
| cases_14 | Number of COVID-19 cases on 07/14/2020 | New York Times | https://github.com/nytimes/covid-19-data |
| deaths_14 | Number of COVID-19 deaths on 07/14/2020 | NA | https://github.com/nytimes/covid-19-data |
| cases_27 | Number of COVID-19 cases on 07/27/2020 | New York Times | https://github.com/nytimes/covid-19-data |
| deaths_27 | Number of COVID-19 deaths on 07/27/2020 | NA | https://github.com/nytimes/covid-19-data |
| case_growth_1 | cases_14/cases_02 | NA | NA |
| case_growth_2 | cases_27/cases_14 | NA | NA |
| pop_2019 | Population estimate in 2019 | United States Census Bureau | https://www.census.gov/newsroom/press-kits/2019/national-state-estimates.html |
| ru_continuum | 1 to 10 rating on the Rural-Urban Continuum | United States Census Bureau | https://www.census.gov/newsroom/press-kits/2019/national-state-estimates.html |
| density | Population density of the county | county_level_election.csv from class | NA |
| pct_less_than_hs | Percent of adults with less than a high school diploma, 2014-18 | 2014-18 American Community Survey | https://www.ers.usda.gov/data-products/county-level-data-sets/download-data/ |
| pct_hs | Percent of adults with a high school diploma only, 2014-18 | 2014-18 American Community Survey | https://www.ers.usda.gov/data-products/county-level-data-sets/download-data/ |
| pct_some_college | Percent of adults completing some college or associate's degree, 2014-18 | 2014-18 American Community Survey | https://www.ers.usda.gov/data-products/county-level-data-sets/download-data/ |
| pct_college | Percent of adults with a bachelor's degree or higher, 2014-18 | 2014-18 American Community Survey | https://www.ers.usda.gov/data-products/county-level-data-sets/download-data/ |
| pct_poverty | Percentage of people estimated to be living in poverty in 2018 | U.S. Census Bureau, Small Area Income and Poverty Estimates (SAIPE) Program | https://www.ers.usda.gov/data-products/county-level-data-sets/download-data/ |
| pct_female | Percentage of females in county, 2019 | U.S. Census Bureau | https://www.census.gov/newsroom/press-kits/2020/population-estimates-detailed.html |
| pct_black | Percentage of Black/African-American residents in county, 2019 | U.S. Census Bureau | https://www.census.gov/newsroom/press-kits/2020/population-estimates-detailed.html |
| pct_native | Percentage of American Indian or Alaskan Native people in county, 2019 | U.S. Census Bureau | https://www.census.gov/newsroom/press-kits/2020/population-estimates-detailed.html |
| pct_hispanic | Percentage of Hispanic people in county, 2019 | U.S. Census Bureau | https://www.census.gov/newsroom/press-kits/2020/population-estimates-detailed.html |
| pct_seniors | Percentage of adults 65 or over in county, 2019 | U.S. Census Bureau | https://www.census.gov/newsroom/press-kits/2020/population-estimates-detailed.html |
| pct_trump_2016 | Percentage of county who voted for Donald Trump in 2016 | county_level_election.csv from class | NA |
| pct_trump_2020 | Percentage of county who voted for Donald Trump in 2020 | Scraped by GitHub user tonmcg from Fox News, Politico, and New York Times | https://github.com/tonmcg/US_County_Level_Election_Results_08-20 |
| dem_governor | Dummy variable coded 1 if the state has a Democratic governor | National Governor's Association | https://www.nga.org/wp-content/uploads/2019/07/Governors-Roster.pdf |
| state_mandate | Dummy variable coded 1 if a statewide mask mandate was enacted before 07/14/2020 | Axios | https://www.axios.com/states-face-coverings-mandatory-a0e2fe35-5b7b-458e-9d28-3f6cdb1032fb.html |
| county_mandate | Dummy variable coded 1 if there was a county-wide mask mandate enacted before -7/14/2020 | Harris Institute of Public Policy | https://www.austinlwright.com/covid-research |
First, we wanted to make sure that our response variable pct_mask is distributed approximately normally. Based on this following histogram, the dis
Other variables that needed to be log transformed were pct_seniors, pct_poverty, and all individual race/ethnicity categories. pct_hs did not need to be log transformed, and pct_female looked skewed both with and without a transformation, so we left it untransformed.
We also have some missing variables in our dataset which we will have to figure out how to impute.
sapply(clean_data_2, function(x) sum(is.na(x)))
## countyfp county_name state pct_mask
## 0 30 0 0
## always frequently sometimes rarely
## 0 0 0 0
## never cases_02 deaths_02 cases_14
## 0 97 97 59
## deaths_14 cases_27 deaths_27 case_growth_1
## 59 42 42 97
## case_growth_2 pop_2019 ru_continuum density
## 59 0 0 3
## pct_less_than_hs pct_hs pct_some_college pct_college
## 0 0 0 0
## pct_poverty pct_female pct_black pct_native
## 1 0 0 0
## pct_hispanic pct_asian pct_seniors pct_trump_2016
## 0 0 0 30
## pct_trump_2020 dem_governor state_mandate county_mandate
## 32 0 0 10
In order to run the linear model, we had to change some of the transformations to log(1+X) so avoid taking the log of 0. Specifically, we had to do this for all 4 race/ethnicity variables and pct_college. We left out two of the education categories to pct_less_than_hs and pct_some_college to avoid multicolinearity, but moving forward, it might be best to create two a variables that sums pct_college and pct_some_college. We might consider doing the same thing with minority groups. Finally, we removed pct_trump2020 from the model because the multicolinearity between that and the 2016 percent was inflating the standard errors. remove pop2019
# interceptmodel = lm(pct_mask ~ 1, data = clean_data_2)
#
# fullmodel = lm(pct_mask ~ ru_continuum + log(density) + pct_hs + log(1+pct_college) +
# log(pct_poverty) + pct_female + log(1+pct_black) + log(1+pct_native) + log(1+pct_hispanic) +
# log(1+pct_asian) + log(pct_seniors) + log(100-pct_trump_2016) + dem_governor +
# state_mandate + county_mandate,
# data = clean_data_2)
#
# summary(fullmodel)
#
# interactionmodel = lm(pct_mask ~ (ru_continuum + log(density) + pct_hs + log(1+pct_college) +
# log(pct_poverty) + pct_female + log(1+pct_black) + log(1+pct_native) + log(1+pct_hispanic) +
# log(1+pct_asian) + log(pct_seniors) + log(100-pct_trump_2016) + dem_governor +
# state_mandate + county_mandate)^2,
# data = clean_data_2)
#
# selected_model = step(fullmodel, scope = list(lower = formula(interceptmodel), upper = formula(interactionmodel)),
# direction = "both", trace = 0)
#
# summary(selected_model)